For this problem set we will apply some of the approaches presented in ISLR for variable selection and model regularization to some of those datasets that we have worked with previously. The goal will be to see whether some of the more principled methods for model selection will allow us better understand relative variable importance, variability of predictive performance of the models, etc.
For the purposes of the preface we will use algae dataset that we used in the lecture to illustrate some of the concepts and approaches here. To do something different here in preface we will be modeling another outcome available there – AG2. The problems in the set will continue using fund raising dataset from the previous problem sets. The flow below follows closely the outline of the Labs 6.5 and 6.6 in ISLR and you are encouraged to refer to them for additional examples and details.
algaeRaw <- read.table ("coil.analysis.data.txt", header=F, sep =",", row.names =NULL, na.strings ="XXXXXXX")
colnames (algaeRaw)= c("season","size","velocity",paste0("C",1:8),paste0("AG",1:7))
algaeRaw[1:3,]
## season size velocity C1 C2 C3 C4 C5 C6 C7
## 1 winter small_ medium 8.00 9.8 60.80 6.238 578.000 105.000 170.000
## 2 spring small_ medium 8.35 8.0 57.75 1.288 370.000 428.750 558.750
## 3 autumn small_ medium 8.10 11.4 40.02 5.330 346.667 125.667 187.057
## C8 AG1 AG2 AG3 AG4 AG5 AG6 AG7
## 1 50.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0
## 2 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1
## 3 15.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7
# remove cases with undefined values and three outliers:
algae <- na.omit(algaeRaw)
algae <- algae[algae[,"C4"]<max(algae[,"C4"],na.rm=TRUE)&algae[,"C3"]<max(algae[,"C3"],na.rm=TRUE)&algae[,"AG4"]<max(algae[,"AG4"],na.rm=TRUE),]
# log-transform selected attributes:
for ( iCol in 1:8 ) {
if ( iCol > 2 ) {
algae[,paste0("C",iCol)] <- log(algae[,paste0("C",iCol)])
}
if ( iCol < 8 ) {
algae[,paste0("AG",iCol)] <- log(1+algae[,paste0("AG",iCol)])
}
}
# we'll use AG2 as an outcome here:
algaeAG2 <- algae[,!colnames(algae)%in%paste0("AG",c(1,3:7))]
pairs(algaeAG2[,-(1:3)])
Assuming that we have read and pre-processed algae data (omitted observations with undefined values, log-transformed where necessary and removed egregious outliers), let’s use regsubsets from library leaps to select optimal models with the number of terms ranging from one to all variables in the dataset using each of the methods available for this function and collect corresponding model metrics (please notice that we override default value of nvmax argument and reflect on as to why we do that and use that specific value – remember that the goal here is to evaluate models up to those that include every predictor available):
summaryMetrics <- NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
# 15 because three categorical attributes are represented by dummy variables:
rsRes <- regsubsets(AG2~.,algaeAG2,method=myMthd,nvmax=15)
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")+theme_bw()
We can see that, except for sequential replacement that a couple of times selected models by far more inferior to those selected by the rest of the methods, and backward coming up also with models a couple of times slightly worse than the rest for corresponding attribute counts (nvars=5:6), all others came up with the models of very comparable performance by every associated metric. Plotting variable membership for each of those models as captured by which attribute of the summary demonstrates that the first four variables are the same regardless of the method used to select them – C8, C1, C3 and medium size, after which variable inclusion somewhat varies by the method used for variable selection:
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
par(old.par)
Next, following Lab 6.5.3 in ISLR we will split our data approximately evenly into training and test, select the best subset of variables on training data, evaluate its performance on training and test and record which variables have been selected each time. First, to be able to use regsubsets output to make predictions we follow ISLR and setup predict function that can be applied to the output from regsubsets (notice .regsubsets in its name – this is how under S3 OOP framework in R methods are matched to corresponding classes – we will further down call it just by passing output from regsubsets to predict – this, in its turn, works because function regsubsets returns object of class regsubsets):
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula(object$call [[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names (coefi)
mat[,xvars] %*% coefi
}
We are all set now to repeatedly draw training sets, choose the best set of variables on them by each of the four different methods available in regsubsets, calculate test error on the remaining samples, etc. To summarize variable selection over multiple splits of the data into training and test, we will use 3-dimensional array whichSum – third dimension corresponding to the four methods available in regsubsets. To split data into training and test we will use again sample function – those who are curious and are paying attention may want to reflect on the difference in how it is done below and how it is implemented in the Ch. 6.5.3 of ISLR and what are the consequences of that. (Hint: consider how size of training or test datasets will vary from one iteration to another in these two implementations)
dfTmp <- NULL
whichSum <- array(0,dim=c(15,16,4),
dimnames=list(NULL,colnames(model.matrix(AG2~.,algaeAG2)),
c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(algaeAG2)))
# Try each method available in regsubsets
# to select the best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain <- regsubsets(AG2~.,algaeAG2[bTrain,],nvmax=15,method=jSelect)
# Add up variable selections:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
for ( kVarSet in 1:15 ) {
# make predictions:
testPred <- predict(rsTrain,algaeAG2[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-algaeAG2[!bTrain,"AG2"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)+theme_bw()
We can see clear difference in the behavior of training and test error with the increase in the number of attributes added to the model. The training error gradually decreases even if for the larger numbers of predictors included in the model the difference between median errors is small comparing to the variability of the error across multiple splits of the data into training and test. Test error demonstrates obvious decrease upon addition of the second attribute to the model followed by steady increase with the inclusion of three or more attributes in the model. Therefore we can conclude that models with three or more attributes are overfitting in this case.
By plotting average fraction of each variable inclusion in the best model of every size by each of the four methods (darker shades of gray indicate closer to unity fraction of times given variable has been included in the best subset) we can see that the selection of C1 and C8 is farily consistent, while the selection of the rest of the attributes varies across model sizes and selection methods:
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(whichSum)[[3]] ) {
tmpWhich <- whichSum[,,myMthd] / nTries
image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
col=c("white","gray90","gray75","gray50","gray25","gray10"))
axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}
par(old.par)
Similar observations can be made using cross-validation rather than the split of the dataset into training and test that is omitted here for the purposes of brevity.
As explained in the lecture and ISLR Ch.6.6 lasso and ridge regression can be performed by glmnet function from library glmnet – its argument alpha governs the form of the shrinkage penalty, so that alpha=0 corresponds to ridge and alpha=1 – to lasso regression. The arguments to glmnet differ from those used for lm for example and require specification of the matrix of predictors and outcome separately. model.matrix is particularly helpful for specifying matrix of predictors by creating dummy variables for categorical predictors:
# -1 to get rid of intercept that glmnet knows to include:
x <- model.matrix(AG2~.,algaeAG2)[,-1]
head(algaeAG2)
## season size velocity C1 C2 C3 C4 C5 C6
## 1 winter small_ medium 8.00 9.8 4.107590 1.8306596 6.359574 4.653960
## 2 spring small_ medium 8.35 8.0 4.056123 0.2530906 5.913503 6.060874
## 3 autumn small_ medium 8.10 11.4 3.689379 1.6733512 5.848365 4.833636
## 4 spring small_ medium 8.07 4.8 4.348522 0.8337783 4.586823 4.113853
## 5 autumn small_ medium 8.06 9.0 4.013677 2.3433431 5.454038 4.064263
## 6 winter small_ high__ 8.25 13.1 4.185860 2.2244073 6.063785 2.904165
## C7 C8 AG2
## 1 5.135798 3.9120230 0.000000
## 2 6.325702 0.2623643 2.151762
## 3 5.231413 2.7472709 4.000034
## 4 4.932313 0.3364722 3.737670
## 5 4.580673 2.3513753 1.360977
## 6 4.037192 3.3463891 2.747271
# notice how it created dummy variables for categorical attributes
head(x)
## seasonspring seasonsummer seasonwinter sizemedium sizesmall_
## 1 0 0 1 0 1
## 2 1 0 0 0 1
## 3 0 0 0 0 1
## 4 1 0 0 0 1
## 5 0 0 0 0 1
## 6 0 0 1 0 1
## velocitylow___ velocitymedium C1 C2 C3 C4 C5
## 1 0 1 8.00 9.8 4.107590 1.8306596 6.359574
## 2 0 1 8.35 8.0 4.056123 0.2530906 5.913503
## 3 0 1 8.10 11.4 3.689379 1.6733512 5.848365
## 4 0 1 8.07 4.8 4.348522 0.8337783 4.586823
## 5 0 1 8.06 9.0 4.013677 2.3433431 5.454038
## 6 0 0 8.25 13.1 4.185860 2.2244073 6.063785
## C6 C7 C8
## 1 4.653960 5.135798 3.9120230
## 2 6.060874 6.325702 0.2623643
## 3 4.833636 5.231413 2.7472709
## 4 4.113853 4.932313 0.3364722
## 5 4.064263 4.580673 2.3513753
## 6 2.904165 4.037192 3.3463891
y <- algaeAG2[,"AG2"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)
Plotting output of glmnet illustrates change in the contributions of each of the predictors as amount of shrinkage changes. In ridge regression each predictor contributes more or less over the entire range of shrinkage levels.
Output of cv.glmnet shows averages and variabilities of MSE in cross-validation across different levels of regularization. lambda.min field indicates values of \(\lambda\) at which the lowest average MSE has been achieved, lambda.1se shows larger \(\lambda\) (more regularization) that has MSE 1SD (of cross-validation) higher than the minimum – this is an often recommended \(\lambda\) to use under the idea that it will be less susceptible to overfit. You may find it instructive to experiment by providing different levels of lambda other than those used by default to understand sensitivity of gv.glmnet output to them. predict depending on the value of type argument allows to access model predictions, coefficients, etc. at a given level of lambda:
cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.698709
cvRidgeRes$lambda.1se
## [1] 4.929254
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -3.6636749541
## seasonspring 0.0019528920
## seasonsummer -0.0045681513
## seasonwinter -0.1062299037
## sizemedium -0.1597753515
## sizesmall_ -0.0554805340
## velocitylow___ 0.1256901282
## velocitymedium 0.1827281409
## C1 0.5241878470
## C2 0.0003630096
## C3 0.1120360206
## C4 -0.0099176284
## C5 -0.0192673533
## C6 0.0772250570
## C7 0.0261154459
## C8 0.1359431662
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.734574316
## seasonspring 0.007636825
## seasonsummer -0.011666715
## seasonwinter -0.027350113
## sizemedium -0.035150827
## sizesmall_ -0.056221820
## velocitylow___ 0.081196974
## velocitymedium 0.072385938
## C1 0.201361273
## C2 -0.006115373
## C3 0.052906544
## C4 0.003021234
## C5 0.005023369
## C6 0.041438790
## C7 0.036177946
## C8 0.057914982
# and with lambda's other than default:
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-80:80)/20))
plot(cvRidgeRes)
Similarly to what was observed for variable selection methods above, plot of cross-validation error for ridge regression has well-defined minimum indicating that some amount of regularization is necessary for the model using all attributes to prevent overfitting. Notice that minimum \(MSE\simeq 1.25\) from ridge regression here is very comparable to the minimum observed above for average test error when variables were selected by regsubsets.
Relatively higher contributions of C1 and C8 to the model outcomed are more apparent for the results of ridge regression performed on centered and, more importantly, scaled matrix of predictors:
ridgeResScaled <- glmnet(scale(x),y,alpha=0)
plot(ridgeResScaled)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0)
plot(cvRidgeResScaled)
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.453178693
## seasonspring 0.003334134
## seasonsummer -0.004979054
## seasonwinter -0.012738898
## sizemedium -0.017526795
## sizesmall_ -0.026308489
## velocitylow___ 0.030675439
## velocitymedium 0.035757095
## C1 0.094910149
## C2 -0.014544578
## C3 0.059341501
## C4 0.002802531
## C5 0.006876212
## C6 0.052635676
## C7 0.040776918
## C8 0.082589538
Lasso regression is done by the same call to glmnet except that now alpha=1. One can see now how more coefficients become zeroes with increasing amount of shrinkage. Notice that amount of regularization increases from right to left when plotting output of glmnet and from left to right when plotting output of cv.glmnet.
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)
cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)
# With other than default levels of lambda:
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-120:0)/20))
plot(cvLassoRes)
Also well-defined minimum of cross-validation MSE for lasso regularization.
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -2.58075850
## seasonspring .
## seasonsummer .
## seasonwinter .
## sizemedium .
## sizesmall_ .
## velocitylow___ .
## velocitymedium .
## C1 0.45361507
## C2 .
## C3 0.04081512
## C4 .
## C5 .
## C6 .
## C7 .
## C8 0.13480989
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -4.762288560
## seasonspring .
## seasonsummer .
## seasonwinter .
## sizemedium -0.001062672
## sizesmall_ .
## velocitylow___ .
## velocitymedium 0.056201304
## C1 0.674310940
## C2 .
## C3 0.092840795
## C4 .
## C5 .
## C6 0.043297784
## C7 .
## C8 0.161133952
As explained above and illustrated in the plots for the output of cv.glmnet, lambda.1se typically corresponds to more shrinkage with more coefficients set to zero by lasso. Use of scaled predictors matrix makes for more apparent contributions of C1 and C8, and to smaller degree, C3:
lassoResScaled <- glmnet(scale(x),y,alpha=1)
plot(lassoResScaled)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
plot(cvLassoResScaled)
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.45317869
## seasonspring .
## seasonsummer .
## seasonwinter .
## sizemedium .
## sizesmall_ .
## velocitylow___ .
## velocitymedium .
## C1 0.20081746
## C2 .
## C3 0.03376726
## C4 .
## C5 .
## C6 .
## C7 .
## C8 0.18722644
Lastly, we can run lasso on several training datasets and calculate corresponding test MSE and frequency of inclusion of each of the coefficients in the model:
lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1)
lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 1.33891
lassoCoefCnt
## seasonspring seasonsummer seasonwinter sizemedium sizesmall_
## 0 0 0 0 0
## velocitylow___ velocitymedium C1 C2 C3
## 0 1 30 0 12
## C4 C5 C6 C7 C8
## 0 1 6 0 25
One can conclude that typical lasso model includes two, sometimes three, coefficients and (by comparison with some of the plots above) that its test MSE is about what was observed for two to three variable models as chosen by the best subset selection approach.
Using fund raising dataset from the week 4 problem set (properly preprocessed: shifted/log-transformed, predictions supplied with the data excluded) select the best subsets of variables for predicting contrib by the methods available in regsubsets. Plot corresponding model metrics (rsq, rss, etc.) and discuss results presented in these plots (e.g. what number of variables appear to be optimal by different metrics) and which variables are included in models of which sizes (e.g. are there variables that are included more often than others?).
It is up to you as to whether you want to include gender attribute in your analysis. It is a categorical attribute and as such it has to be correctly represented by dummy variable(s). If you do that properly (and above preface supplies abundant examples of doing that), you will be getting three extra points for each of the problems in this week problem set that (correctly!) included gender in the analysis for the possible total extra of 3x4=12 points. If you prefer not to add this extra work, then excluding gender from the data at the outset (as you were instructed to do for week 4) is probably the cleanest way to prevent it from getting in the way of your computations and limit the points total you can earn for this problem set to the standard 60 points.
Splitting fund raising dataset into training and test as shown above, please calculate and plot training and test errors (MSE) for each model size for the methods available for regsubsets. Using which field investigate stability of variable selection at each model size across multiple selections of training/test data. Discuss these results – e.g. what model size appears to be the most useful by this approach, what is the error rate corresponing to it, how stable is this conclusion across multiple methods for the best subset selection, how does this error compare to that of the predictions provided with the data (predcontr attribute)?
Fit lasso regression model of the outcome in fund raising dataset. Plot and discuss glmnet and cv.glmnet results. Compare coefficient values at cross-validation minimum MSE and that 1SE away from it – which coefficients are set to zero? Experiment with different ranges of lambda passed to cv.glmnet and discuss the results.
Similarly to the example shown in Preface above use resampling to estimate test error of lasso models fit to training data and stability of the variable selection by lasso across different splits of data into training and test. Use resampling approach of your choice. Compare typical model size to that obtained by the best subset selection above. Compare test error observed here to that of the predictions supplied with the data (predcontr) and the models fit above – discuss the results.